HS.model <- ' visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
fit <- cfa(HS.model, data = HolzingerSwineford1939)
lavaanPlot(model = fit)An introduction to modern test theory
2023-11-28
We want to measure something that is not directly accessible by using one or more proxy indicators.
Based on observed indicators (response data collected from participants), the latent variable is assigned a value for each respondent/participant. The value of the latent variable is the measurement. The indicators themselves are not measurements, they are indicators of a latent variable.
You will create a brief questionnaire to measure a unidimensional latent variable of your choice.
We’ll primarily look at the first two types, as they are by far the most used in psychology and psychometrics.
Johansson, M., Preuter, M., Karlsson, S., Möllerberg, M.-L., Svensson, H., & Melin, J. (2023). Valid and Reliable? Basic and Expanded Recommendations for Psychometric Reporting and Quality Assessment. OSF Preprints. https://doi.org/10.31219/osf.io/3htzc
We need a reasonable psychometric analysis to justify the use of a sum score! And even then the “sum score” is a debatable metric in itself since it is ordinal data, but often gets treated like interval data in statistical models.
We will look at each of these in more detail during this lecture.
| Criterion | Description |
|---|---|
| Unidimensionality | Items represent one latent variable, without strongly correlated item residuals. Principal Component Analysis and Exploratory Factor Analysis of raw data are explorative methods. |
| Criterion | Description |
|---|---|
| Ordered response categories | A higher person location (sum score) on the latent variable should entail an increased probability of a higher response (category) for all items and vice versa. Sometimes referred to as 'monotonicity'. |
| Criterion | Description |
|---|---|
| Invariance | Item and measure properties are consistent between relevant demographic groups (gender, age, ethnicity, time, etc). Test-retest correlation is not an invariance test since it does not provide information about item properties. |
| Criterion | Description |
|---|---|
| Targeting | Item (threshold) locations compared to person locations should be well matched and not show ceiling or floor effects, or large gaps. |
| Criterion | Description |
|---|---|
| Reliability | Sufficient reliability for the expected properties of the target population and intended use of results. Reliability is contingent upon the other criteria being fulfilled and should not be reported for scales with inadequate properties. |
CTT = factor analysis, principal component analysis, etc
Modern test theory = IRT, Rasch, Mokken, etc
Major differences:
No psychometric/statistical method is “safe” from misuse. Some common mistakes to look for in papers:
Let’s dive into Item Response Theory and Rasch Measurement Theory!
| I1 | I2 | I3 | I4 | I5 | I6 | I7 | I8 | I9 |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
The assumed basic structure of the data is that there is a systematic pattern across items and participants of an increased probability of correct responses as the latent ability increases. The Rasch/IRT 1PL model can be described as a probabilistic Guttman scale (Andrich 1985).
This figure shows items and persons sorted based on the number of correct responses (colored blue). You can see the gradual shift from lower left to upper right that shows the Guttman pattern.
This is a key figure in understanding IRT and the concept of item difficulty/location. The point on the x-axis where the line crosses 0.5 is the item difficulty (item location). This is the threshold when the probability of a correct response becomes higher than the probability of an incorrect response. ICC = item characteristic curve.
Nine dichotomous items in the same plot.
This figure illustrates how the items are ordered (“item hierarchy”) and the locations on the latent variable where they provide the most information - the “item threshold” - which is the location of probability 0.5 for each dichotomous item.
This is another key figure, derived from a “Wright map”. The points in the bottom part show the thresholds for each item. The top histogram shows the distribution of the latent variable for the participants (person locations/abilities). The middle section aggregates the item thresholds to help visualize how well the items fit the persons.
Since items and persons are on the same scale, we can infer a persons item responses from their latent variable score. Let’s take a person with score 0 as an example.
Which items would this person score correctly?
We have looked at the Rasch model for dichotomous data, also sometimes (not quite correctly) referred to as the IRT 1PL model. The key parameter is the item location/difficulty. 1PL stands for “one parameter logistic” model, and the parameter is the item location.
I generally use the term “location” for both items and persons as it is more generic. But for didactic purposes, when speaking of ability tests, it is probably easier to think about the specific term “difficulty” for items and how it related to the persons latent “ability”.
In IRT terminology, “person location” is frequently referred to as “theta”, often using this symbol: \(\theta\)
IRT/Rasch uses the logit scale, which is an interval scale, for both items and persons. This means that the distance between two points on the scale is the same, regardless of where on the scale you are.
However, the values on the logit scale have no inherent meaning or external reference point. This is why we need to look at the item difficulty and person ability in relation to each other. Do not conflate the zero point on the logit scale with something like 0 on a Z-score scale.
Let’s return to the items you created before.
The 2PL and 3PL models are also commonly used. The 2PL model adds a second parameter, the item discrimination, which is a measure of how well the item separates between persons with high and low ability.
The 3PL model adds a third parameter, which makes the ICC figure look like this for example.
Can you guess what the third parameter is?
Now let’s move on to questionnaire type of data with ordered response categories.
We’ll focus on the Rasch model, in part since it is complicated enough for this short lecture. But also since it is the only model that allows the ordinal sum score to be used as a sufficient statistic for the latent variable.
This is the same type of figure as before, and it now shows probabilities for all response categories for one item. How do you interpret it?
Recall \(\theta\) ? Let’s say a person responds “Often” to this item, in which range is it most likely that their theta is?
This is an item from the Perceived Stress Scale (PSS).
Is this person’s theta a high or low value? How does it relate to the overall population?
q8n: “found that you could not cope with all the things that you had to do?” - “Sometimes”
q2n: “felt that you were unable to control important things in your life?” - “Seldom”
See how that error bar gets smaller and smaller as we add more items? That’s because we’re getting more and more information about the respondents theta with each item.
Here are all 7 items from the PSS negative items (Rozental, Forsström, and Johansson 2023). Discuss 2 & 2 how you interpret this figure.
Why is this figure of interest?
Test information - reflects item properties, not sample/person properties.
We have been using the Partial Credit Model with Conditional Maximum Likelihood estimation, using the eRm package for R.
When analyzing polytomous data with Rasch/IRT models, the lowest response category is always set to 0, as it reflects on the number of thresholds “passed” by the respondent. Think of this related to the dichotomous mode, where 0 and 1 are the only scores available and the sum score is just a count of the number of items with correct responses. In the polytomous case, the sum score is a count of the number of thresholds passed per item, summed together.
We could put any set of items into a Rasch model and just look at ICC curves and targeting and estimate thetas all we want. But that is no better than “sum score and alpha”.
While multidimensional constructs are possible, it is outside the scope of this lecture. Most often, even unidimensional measures are not well constructed. The most common problem (imho) is residual correlations. We’ll look at four ways to assess unidimensionality in a Rasch model:
We’ll use the same published dataset and paper analyzing the PSS-14 scale as before (Rozental, Forsström, and Johansson 2023) as an example for dimensionality analysis.
We use multiple tests since there is no single test to establish unidimensionality. This is also true for CTT analysis.
Does everyone know what residuals are?
“Outfit” refers to item fit when person location is far from the item location, while “infit” refers to when person and item locations are close together. MSQ should be close to 1, with lower and upper cutoffs often set to 0.7 and 1.3, while ZSTD should be around 0, with cutoffs set to +/- 1.96. Infit is usually more important. Low fit indicates a better than expected fit to the Rasch model but can inflate reliability without adding much information. High fit often reflects multidimensionality.
| OutfitMSQ | InfitMSQ | OutfitZSTD | InfitZSTD | |
|---|---|---|---|---|
| q1n | 0.873 | 0.871 | -1.576 | -1.595 |
| q2n | 0.946 | 0.937 | -0.704 | -1.039 |
| q3n | 0.878 | 0.881 | -1.692 | -1.601 |
| q4p | 0.889 | 0.887 | -1.4 | -1.385 |
| q5p | 0.912 | 0.914 | -1.154 | -1.235 |
| q6p | 0.947 | 0.938 | -0.545 | -1.083 |
| q7p | 0.981 | 0.982 | -0.159 | -0.127 |
| q8n | 0.936 | 0.94 | -0.844 | -0.655 |
| q9p | 1.036 | 1.034 | 0.409 | 0.434 |
| q10p | 1.033 | 1.025 | 0.286 | 0.294 |
| q11n | 0.927 | 0.922 | -0.97 | -0.924 |
| q12n | 0.826 | 0.844 | -1.796 | -1.665 |
| q13p | 1.027 | 1.027 | 0.56 | 0.471 |
| q14n | 1.027 | 1.018 | 0.363 | 0.374 |
The highest eigenvalue should be below 2.0
| Eigenvalue |
|---|
| 6.38 |
| 1.43 |
| 0.84 |
| 0.80 |
| 0.69 |
| q1n | q2n | q3n | q4p | q5p | q6p | q7p | q8n | q9p | q10p | q11n | q12n | q13p | q14n | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| q1n | ||||||||||||||
| q2n | 0.53 | |||||||||||||
| q3n | 0.4 | 0.53 | ||||||||||||
| q4p | -0.33 | -0.43 | -0.35 | |||||||||||
| q5p | -0.37 | -0.52 | -0.44 | 0.58 | ||||||||||
| q6p | -0.33 | -0.54 | -0.49 | 0.5 | 0.64 | |||||||||
| q7p | -0.4 | -0.52 | -0.44 | 0.39 | 0.46 | 0.51 | ||||||||
| q8n | 0.21 | 0.42 | 0.42 | -0.28 | -0.38 | -0.39 | -0.39 | |||||||
| q9p | -0.47 | -0.46 | -0.4 | 0.42 | 0.42 | 0.38 | 0.39 | -0.32 | ||||||
| q10p | -0.38 | -0.58 | -0.52 | 0.39 | 0.48 | 0.5 | 0.53 | -0.49 | 0.42 | |||||
| q11n | 0.42 | 0.35 | 0.3 | -0.29 | -0.35 | -0.29 | -0.34 | 0.14 | -0.47 | -0.27 | ||||
| q12n | 0.14 | 0.27 | 0.36 | -0.12 | -0.17 | -0.22 | -0.17 | 0.35 | -0.19 | -0.28 | 0.06 | |||
| q13p | -0.2 | -0.45 | -0.36 | 0.27 | 0.36 | 0.35 | 0.31 | -0.46 | 0.24 | 0.49 | -0.17 | -0.3 | ||
| q14n | 0.31 | 0.54 | 0.51 | -0.43 | -0.52 | -0.57 | -0.5 | 0.47 | -0.37 | -0.6 | 0.25 | 0.25 | -0.5 | |
| Note: | ||||||||||||||
| Relative cut-off value (highlighted in red) is 0.172, which is 0.2 above the average correlation. |
We clearly have two clusters of items. Can you spot the pattern?
A higher person location on the latent variable should entail an increased probability of a higher response (category) for all items and vice versa. This is sometimes referred to as ‘monotonicity’.
We can check this by looking at the item characteristic curves (ICC). So far, we have only seen ICCs with ordered response categories. We will look at an example with disordered response categories.
This is by far the most common type of data in psychology and social sciences. But it is extremely common to pretend that ordinal data is interval data, and use it as such in everything from simple calculations such as mean/SD to more complex statistical models. This is a problem (Liddell and Kruschke 2018) and there are methods for analyzing ordinal data properly (Bürkner and Vuorre 2019).
Adding more response categories does not make them anything else than ordinal, neither does removing the labels. Visual Analogue Scales have no strong case for being interval level either. All three methods usually add to problems with disordered response categories.
We’ll use an open dataset (Didino et al. 2019) of the Flourishing Scale (Diener et al. 2010). It has 8 items:
| itemnr | item |
|---|---|
| flourish1 | I lead a purposeful and meaningful life. |
| flourish2 | My social relationships are supportive and rewarding. |
| flourish3 | I am engaged and interested in my daily activities. |
| flourish4 | I actively contribute to the happiness and well-being of others. |
| flourish5 | I am competent and capable in the activities that are important to me. |
| flourish6 | I am a good person and live a good life. |
| flourish7 | I am optimistic about my future. |
| flourish8 | People respect me. |
All items share the same set of 7 response categories:
| Response | Ordinal |
|---|---|
| Strongly disagree | 0 |
| Disagree | 1 |
| Slightly disagree | 2 |
| Mixed or neither agree nor disagree | 3 |
| Slightly agree | 4 |
| Agree | 5 |
| Strongly agree | 6 |
Everything below this point is just notes and drafts.
Rozental, A., Forsström, D., & Johansson, M. (2023). A psychometric evaluation of the Swedish translation of the Perceived Stress Scale: A Rasch analysis. BMC Psychiatry, 23(1), 690. https://doi.org/10.1186/s12888-023-05162-4
There are guides and tools available for doing Bayesian IRT in R as well. The brms package is highly recommended (Bürkner 2017, 2021) and this excellent blog post: https://solomonkurz.netlify.app/blog/2021-12-29-notes-on-the-bayesian-cumulative-probit/
All measurements have some degree of error/uncertainty.
Even measuring concrete and directly accessible things like the weight of an object, there is ALWAYS a margin of error in the measurement.